Model Report for `dataset_1`

generated on 26 Mar 2024, 23:46

Dataset
Original Samples10,000
Synthetic Samples10,000
Target Columns14
Accuracy
87.9%
(98.6%)
Univariate 92.0%
Bivariate 83.8%
Distances
Identical Matches 0.0% (0.0%)
Average Distances 1.92 (1.87)

Correlations

Univariate Distributions

 

Bivariate Distributions

 
 

Accuracy

Column Univariate Bivariate
PWF 100.0% 91.3%
RNF 99.9% 91.4%
HDF 99.8% 91.3%
OSF 99.8% 91.3%
TWF 99.8% 91.3%
Type 99.6% 90.9%
Machine failure 99.4% 91.0%
Tool wear [min] 99.0% 89.4%
UDI 98.9% 89.6%
Process temperature [K] 98.5% 89.3%
Air temperature [K] 98.4% 89.2%
Rotational speed [rpm] 98.0% 88.9%
Torque [Nm] 97.5% 88.6%
Product ID 0.0% 0.0%
Total 92.0% 83.8%

Explainer
Accuracy of synthetic data is assessed by comparing the distributions of the synthetic (shown in green) and the original data (shown in gray). For each distribution plot we sum up the deviations across all categories, to get the so-called total variation distance (TVD). The reported accuracy is then simply reported as 100% - TVD. These accuracies are calculated for all univariate and bivariate distributions. A final accuracy score is then calculated as the average across all of these.

Distances


Identical Matches: 0.0% (0.0%)

Average Distances: 1.92 (1.87)



Explainer
Synthetic data shall be close, but not too close to the original data in order to preserve the confidentiality of the original samples. This can be asserted by checking for exact matches between synthetic and original data, as well as by measuring distances between synthetic records to their closest original records. These statistics are then compared against the observed statistics within the original data itself, and tested for statistical significance. In addition, their distributions are visualized above, with the distances for the synthetic data displayed in green, and the distances for the original data displayed in gray. A green line that is significantly left of the gray line within the cumulative density plots implies that the generated data is too close to the actual records.